データの可視化(Data Visualization)は、データサイエンスのひとつの核であるが、表現能力、コミュニケーション能力とともに、基本的な技術も必要とされ、
Rによるプログラミングのたいせつな部分である。少しずつ学びながら、例を蓄積していく。参考としたもの(References) は最後に記す。
# message = FALSE
library(tidyverse)
## ─ Attaching packages ───────────────────────────── tidyverse 1.3.0 ─
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ─ Conflicts ─────────────────────────────── tidyverse_conflicts() ─
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
R Base の プロット(plot)。 ggplot2 を使った プロット(plot)を参考のために、併記する。
Help に付属の example も考察する。
二次元のプロット。base では、二つのベクトルで可。ggplot2 では基本的に、データフレームの二つの列を使う。
plot(x, y, ...)二つのベクトルを、x, y に割り付ける。
plot(mtcars$wt, mtcars$mpg)
# ggplot2
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
ベクトルを引数 (augments) として渡す、下の code も可能だが、推奨されていない。
ggplot(data = NULL, aes(x = mtcars$wt, y = mtcars$mpg)) +
geom_point()
plot(x, y, type = "l")二つ目のグラフを加える。
plot(pressure$temperature, pressure$pressure, type = "l")
points(pressure$temperature, pressure$pressure)
lines(pressure$temperature, pressure$pressure/2, col = "red")
points(pressure$temperature, pressure$pressure/2, col = "red")
ggplot2 では、階層を追加していく
ggplot(pressure) +
geom_line(aes(x = temperature, y = pressure), color = "black") +
geom_point(aes(x = temperature, y = pressure), color = "black") +
geom_line(aes(x = temperature, y = pressure/2), color = "red") +
geom_point(aes(x = temperature, y = pressure/2), color = "red")
tidy data にして、group を使う方法もある。
pressure1 <- pressure %>% mutate(type = "A")
pressure2 <- pressure %>% mutate(pressure2 = pressure/2, type = "B") %>% select(temperature, pressure2, type)
colnames(pressure2) <- colnames(pressure1)
pressure0 <- bind_rows(pressure1, pressure2)
# ggplot2
ggplot(pressure0, aes(x = temperature, y = pressure, group = type)) +
geom_line(aes(color = type)) +
geom_point(aes(color = type))
methods(plot)
## [1] plot,ANY-method plot,color-method plot.acf*
## [4] plot.ACF* plot.augPred* plot.compareFits*
## [7] plot.data.frame* plot.decomposed.ts* plot.default
## [10] plot.dendrogram* plot.density* plot.ecdf
## [13] plot.factor* plot.formula* plot.function
## [16] plot.ggplot* plot.gls* plot.gtable*
## [19] plot.hcl_palettes* plot.hclust* plot.histogram*
## [22] plot.HoltWinters* plot.intervals.lmList* plot.isoreg*
## [25] plot.lm* plot.lme* plot.lmList*
## [28] plot.medpolish* plot.mlm* plot.nffGroupedData*
## [31] plot.nfnGroupedData* plot.nls* plot.nmGroupedData*
## [34] plot.pdMat* plot.ppr* plot.prcomp*
## [37] plot.princomp* plot.profile.nls* plot.R6*
## [40] plot.ranef.lme* plot.ranef.lmList* plot.raster*
## [43] plot.shingle* plot.simulate.lme* plot.spec*
## [46] plot.stepfun plot.stl* plot.table*
## [49] plot.trans* plot.trellis* plot.ts
## [52] plot.tskernel* plot.TukeyHSD* plot.Variogram*
## see '?methods' for accessing help and source code
Arguments
x
the coordinates of points in the plot. Alternatively, a single plotting structure, function or any R object with a plot method can be provided.
y
the y coordinates of points in the plot, optional if x is an appropriate structure.
...
Arguments to be passed to methods, such as graphical parameters (see par). Many methods will accept the following arguments:
type
what type of plot should be drawn. Possible types are
"p" for points,
"l" for lines,
"b" for both,
"c" for the lines part alone of "b",
"o" for both ‘overplotted’,
"h" for ‘histogram’ like (or ‘high-density’) vertical lines,
"s" for stair steps,
"S" for other steps, see ‘Details’ below,
"n" for no plotting.
All other types give a warning or an error; using, e.g., type = "punkte" being equivalent to type = "p" for S compatibility. Note that some methods, e.g. plot.factor, do not accept this.
main
an overall title for the plot: see title.
sub
a sub title for the plot: see title.
xlab
a title for the x axis: see title.
ylab
a title for the y axis: see title.
asp
the y/x aspect ratio, see plot.window.
Details
The two step types differ in their x-y preference: Going from (x1,y1) to (x2,y2) with x1 < x2, type = "s" moves first horizontal, then vertical, whereas type = "S" moves the other way around.
See Also
plot.default, plot.formula and other methods; points, lines, par. For thousands of points, consider using smoothScatter() instead of plot().
For X-Y-Z plotting see contour, persp and image.
Note:
methods(plot) で表示したように、膨大な形式がある。example(plot)
##
## plot> require(stats) # for lowess, rpois, rnorm
##
## plot> plot(cars)
##
## plot> lines(lowess(cars))
##
## plot> plot(sin, -pi, 2*pi) # see ?plot.function
##
## plot> ## Discrete Distribution Plot:
## plot> plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
## plot+ main = "rpois(100, lambda = 5)")
##
## plot> ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:
## plot> plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")
##
## plot> points(x, cex = .5, col = "dark red")
lowess(Local Polynomial Regression Fitting) で、線分で補間。plot(x, type = "s"): 標準正規分布(平均0,標準偏差1)の47個のサンプルを小さい順に並べて、点を標準の半分の大きさで、濃い赤で階段状にプロットし、主タイトルを付ける。cex = .5 となっているので、点を通常の半分にしている。barplot(height, ...)棒の高さとなるベクトルを与える。棒のラベルを指定するときは、names.arg
barplot(BOD$demand)
barplot(BOD$demand, names.arg = BOD$Time)
ベクトル内の、それぞれの値の個数を table で生成して、棒グラフとする。
# cyl = number of cylinders
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
barplot(table(mtcars$cyl))
ggplot2 では、geom_col を使う。
# ggplot2
ggplot(BOD, aes(x = Time, y = demand)) +
geom_col()
変数 x を離散値 (discrete value) として使うときは、ファクター(factor)を使う。
# ggplot2
ggplot(BOD, aes(x = factor(Time), y = demand)) +
geom_col()
geom_bar を使うと、各カテゴリの個数をグラフ化できる。x は連続値。
ggplot(mtcars, aes(x = cyl)) +
geom_bar()
個数データの棒グラフ。x は factor (category)
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar()
## Default S3 method:
barplot(height, width = 1, space = NULL,
names.arg = NULL, legend.text = NULL, beside = FALSE,
horiz = FALSE, density = NULL, angle = 45,
col = NULL, border = par("fg"),
main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
axes = TRUE, axisnames = TRUE,
cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,
add = FALSE, ann = !add && par("ann"), args.legend = NULL, ...)
## S3 method for class 'formula'
barplot(formula, data, subset, na.action,
horiz = FALSE, xlab = NULL, ylab = NULL, ...)
Arguments
height
either a vector or matrix of values describing the bars which make up the plot. If height is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If height is a matrix and beside is FALSE then each bar of the plot corresponds to a column of height, with the values in the column giving the heights of stacked sub-bars making up the bar. If height is a matrix and beside is TRUE, then the values in each column are juxtaposed rather than stacked.
width
optional vector of bar widths. Re-cycled to length the number of bars drawn. Specifying a single value will have no visible effect unless xlim is specified.
space
the amount of space (as a fraction of the average bar width) left before each bar. May be given as a single number or one number per bar. If height is a matrix and beside is TRUE, space may be specified by two numbers, where the first is the space between bars in the same group, and the second the space between the groups. If not given explicitly, it defaults to c(0,1) if height is a matrix and beside is TRUE, and to 0.2 otherwise.
names.arg
a vector of names to be plotted below each bar or group of bars. If this argument is omitted, then the names are taken from the names attribute of height if this is a vector, or the column names if it is a matrix.
legend.text
a vector of text used to construct a legend for the plot, or a logical indicating whether a legend should be included. This is only useful when height is a matrix. In that case given legend labels should correspond to the rows of height; if legend.text is true, the row names of height will be used as labels if they are non-null.
beside
a logical value. If FALSE, the columns of height are portrayed as stacked bars, and if TRUE the columns are portrayed as juxtaposed bars.
horiz
a logical value. If FALSE, the bars are drawn vertically with the first bar to the left. If TRUE, the bars are drawn horizontally with the first at the bottom.
density
a vector giving the density of shading lines, in lines per inch, for the bars or bar components. The default value of NULL means that no shading lines are drawn. Non-positive values of density also inhibit the drawing of shading lines.
angle
the slope of shading lines, given as an angle in degrees (counter-clockwise), for the bars or bar components.
col
a vector of colors for the bars or bar components. By default, grey is used if height is a vector, and a gamma-corrected grey palette if height is a matrix.
border
the color to be used for the border of the bars. Use border = NA to omit borders. If there are shading lines, border = TRUE means use the same colour for the border as for the shading lines.
main,sub
overall and sub title for the plot.
xlab
a label for the x axis.
ylab
a label for the y axis.
xlim
limits for the x axis.
ylim
limits for the y axis.
xpd
logical. Should bars be allowed to go outside region?
log
string specifying if axis scales should be logarithmic; see plot.default.
axes
logical. If TRUE, a vertical (or horizontal, if horiz is true) axis is drawn.
axisnames
logical. If TRUE, and if there are names.arg (see above), the other axis is drawn (with lty = 0) and labeled.
cex.axis
expansion factor for numeric axis labels.
cex.names
expansion factor for axis names (bar labels).
inside
logical. If TRUE, the lines which divide adjacent (non-stacked!) bars will be drawn. Only applies when space = 0 (which it partly is when beside = TRUE).
plot
logical. If FALSE, nothing is plotted.
axis.lty
the graphics parameter lty applied to the axis and tick marks of the categorical (default horizontal) axis. Note that by default the axis is suppressed.
offset
a vector indicating how much the bars should be shifted relative to the x axis.
add
logical specifying if bars should be added to an already existing plot; defaults to FALSE.
ann
logical specifying if the default annotation (main, sub, xlab, ylab) should appear on the plot, see title.
args.legend
list of additional arguments to pass to legend(); names of the list are used as argument names. Only used if legend.text is supplied.
formula
a formula where the y variables are numeric data to plot against the categorical x variables. The formula can have one of three forms:
y ~ x
y ~ x1 + x2
cbind(y1, y2) ~ x
, see the examples.
data
a data frame (or list) from which the variables in formula should be taken.
subset
an optional vector specifying a subset of observations to be used.
na.action
a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.
...
arguments to be passed to/from other methods. For the default method these can include further arguments (such as axes, asp and main) and graphical parameters (see par) which are passed to plot.window(), title() and axis.
Value
A numeric vector (or matrix, when beside = TRUE), say mp, giving the coordinates of all the bar midpoints drawn, useful for adding to the graph.
If beside is true, use colMeans(mp) for the midpoints of each group of bars, see example.
Author(s)
R Core, with a contribution by Arni Magnusson.
example(barplot)
##
## barplt> # Formula method
## barplt> barplot(GNP ~ Year, data = longley)
##
## barplt> barplot(cbind(Employed, Unemployed) ~ Year, data = longley)
##
## barplt> ## 3rd form of formula - 2 categories :
## barplt> op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))
##
## barplt> summary(d.Titanic <- as.data.frame(Titanic))
## Class Sex Age Survived Freq
## 1st :8 Male :16 Child:16 No :16 Min. : 0.00
## 2nd :8 Female:16 Adult:16 Yes:16 1st Qu.: 0.75
## 3rd :8 Median : 13.50
## Crew:8 Mean : 68.78
## 3rd Qu.: 77.00
## Max. :670.00
##
## barplt> barplot(Freq ~ Class + Survived, data = d.Titanic,
## barplt+ subset = Age == "Adult" & Sex == "Male",
## barplt+ main = "barplot(Freq ~ Class + Survived, *)", ylab = "# {passengers}", legend = TRUE)
##
## barplt> # Corresponding table :
## barplt> (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age=="Adult"))
## , , Sex = Male
##
## Class
## Survived 1st 2nd 3rd Crew
## No 118 154 387 670
## Yes 57 14 75 192
##
## , , Sex = Female
##
## Class
## Survived 1st 2nd 3rd Crew
## No 4 13 89 3
## Yes 140 80 76 20
##
##
## barplt> # Alternatively, a mosaic plot :
## barplt> mosaicplot(xt[,,"Male"], main = "mosaicplot(Freq ~ Class + Survived, *)", color=TRUE)
##
## barplt> par(op)
##
## barplt> # Default method
## barplt> require(grDevices) # for colours
##
## barplt> tN <- table(Ni <- stats::rpois(100, lambda = 5))
##
## barplt> r <- barplot(tN, col = rainbow(20))
##
## barplt> #- type = "h" plotting *is* 'bar'plot
## barplt> lines(r, tN, type = "h", col = "red", lwd = 2)
##
## barplt> barplot(tN, space = 1.5, axisnames = FALSE,
## barplt+ sub = "barplot(..., space= 1.5, axisnames = FALSE)")
##
## barplt> barplot(VADeaths, plot = FALSE)
## [1] 0.7 1.9 3.1 4.3
##
## barplt> barplot(VADeaths, plot = FALSE, beside = TRUE)
## [,1] [,2] [,3] [,4]
## [1,] 1.5 7.5 13.5 19.5
## [2,] 2.5 8.5 14.5 20.5
## [3,] 3.5 9.5 15.5 21.5
## [4,] 4.5 10.5 16.5 22.5
## [5,] 5.5 11.5 17.5 23.5
##
## barplt> mp <- barplot(VADeaths) # default
##
## barplt> tot <- colMeans(VADeaths)
##
## barplt> text(mp, tot + 3, format(tot), xpd = TRUE, col = "blue")
##
## barplt> barplot(VADeaths, beside = TRUE,
## barplt+ col = c("lightblue", "mistyrose", "lightcyan",
## barplt+ "lavender", "cornsilk"),
## barplt+ legend = rownames(VADeaths), ylim = c(0, 100))
##
## barplt> title(main = "Death Rates in Virginia", font.main = 4)
##
## barplt> hh <- t(VADeaths)[, 5:1]
##
## barplt> mybarcol <- "gray20"
##
## barplt> mp <- barplot(hh, beside = TRUE,
## barplt+ col = c("lightblue", "mistyrose",
## barplt+ "lightcyan", "lavender"),
## barplt+ legend = colnames(VADeaths), ylim = c(0,100),
## barplt+ main = "Death Rates in Virginia", font.main = 4,
## barplt+ sub = "Faked upper 2*sigma error bars", col.sub = mybarcol,
## barplt+ cex.names = 1.5)
##
## barplt> segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)
##
## barplt> stopifnot(dim(mp) == dim(hh)) # corresponding matrices
##
## barplt> mtext(side = 1, at = colMeans(mp), line = -2,
## barplt+ text = paste("Mean", formatC(colMeans(hh))), col = "red")
##
## barplt> # Bar shading example
## barplt> barplot(VADeaths, angle = 15+10*1:5, density = 20, col = "black",
## barplt+ legend = rownames(VADeaths))
##
## barplt> title(main = list("Death Rates in Virginia", font = 4))
##
## barplt> # Border color
## barplt> barplot(VADeaths, border = "dark blue")
##
## barplt> # Log scales (not much sense here)
## barplt> barplot(tN, col = heat.colors(12), log = "y")
##
## barplt> barplot(tN, col = gray.colors(20), log = "xy")
##
## barplt> # Legend location
## barplt> barplot(height = cbind(x = c(465, 91) / 465 * 100,
## barplt+ y = c(840, 200) / 840 * 100,
## barplt+ z = c(37, 17) / 37 * 100),
## barplt+ beside = FALSE,
## barplt+ width = c(465, 840, 37),
## barplt+ col = c(1, 2),
## barplt+ legend.text = c("A", "B"),
## barplt+ args.legend = list(x = "topleft"))
一つのベクトルを引数とする。
hist(mtcars$mpg)
bin を分割する個数は、breaks で指定
hist(mtcars$mpg, breaks = 10)
ggplot2 では、x を指定し、
geom_histogram()を使う。初期値は、bins = 30
ggplot(mtcars, aes(x = mpg)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
binwidth を調整
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 1)
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 4)
## Default S3 method:
hist(x, breaks = "Sturges",
freq = NULL, probability = !freq,
include.lowest = TRUE, right = TRUE,
density = NULL, angle = 45, col = NULL, border = NULL,
main = paste("Histogram of" , xname),
xlim = range(breaks), ylim = NULL,
xlab = xname, ylab,
axes = TRUE, plot = TRUE, labels = FALSE,
nclass = NULL, warn.unused = TRUE, ...)
Arguments
x
a vector of values for which the histogram is desired.
breaks
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only; as the breakpoints will be set to pretty values, the number is limited to 1e6 (with a warning if it was larger). If breaks is a function, the x vector is supplied to it as the only argument (and the number of breaks is only limited by the amount of available memory).
freq
logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).
probability
an alias for !freq, for S compatibility.
include.lowest
logical; if TRUE, an x[i] equal to the breaks value will be included in the first (or last, for right = FALSE) bar. This will be ignored (with a warning) unless breaks is a vector.
right
logical; if TRUE, the histogram cells are right-closed (left open) intervals.
density
the density of shading lines, in lines per inch. The default value of NULL means that no shading lines are drawn. Non-positive values of density also inhibit the drawing of shading lines.
angle
the slope of shading lines, given as an angle in degrees (counter-clockwise).
col
a colour to be used to fill the bars. The default of NULL yields unfilled bars.
border
the color of the border around the bars. The default is to use the standard foreground color.
main, xlab, ylab
main title and axis labels: these arguments to title() get “smart” defaults here, e.g., the default ylab is "Frequency" iff freq is true.
xlim, ylim
the range of x and y values with sensible defaults. Note that xlim is not used to define the histogram (breaks), but only for plotting (when plot = TRUE).
axes
logical. If TRUE (default), axes are draw if the plot is drawn.
plot
logical. If TRUE (default), a histogram is plotted, otherwise a list of breaks and counts is returned. In the latter case, a warning is used if (typically graphical) arguments are specified that only apply to the plot = TRUE case.
labels
logical or character string. Additionally draw labels on top of bars, if not FALSE; see plot.histogram.
nclass
numeric (integer). For S(-PLUS) compatibility only, nclass is equivalent to breaks for a scalar or character argument.
warn.unused
logical. If plot = FALSE and warn.unused = TRUE, a warning will be issued when graphical parameters are passed to hist.default().
...
further arguments and graphical parameters passed to plot.histogram and thence to title and axis (if plot = TRUE).
Details
The definition of histogram differs by source (with country-specific biases). R's default with equi-spaced breaks (also the default) is to plot the counts in the cells defined by breaks. Thus the height of a rectangle is proportional to the number of points falling into the cell, as is the area provided the breaks are equally-spaced.
The default with non-equi-spaced breaks is to give a plot of area one, in which the area of the rectangles is the fraction of the data points falling in the cells.
If right = TRUE (default), the histogram cells are intervals of the form (a, b], i.e., they include their right-hand endpoint, but not their left one, with the exception of the first cell when include.lowest is TRUE.
For right = FALSE, the intervals are of the form [a, b), and include.lowest means ‘include highest’.
A numerical tolerance of 1e-7 times the median bin size (for more than four bins, otherwise the median is substituted) is applied when counting entries on the edges of bins. This is not included in the reported breaks nor in the calculation of density.
The default for breaks is "Sturges": see nclass.Sturges. Other names for which algorithms are supplied are "Scott" and "FD" / "Freedman-Diaconis" (with corresponding functions nclass.scott and nclass.FD). Case is ignored and partial matching is used. Alternatively, a function can be supplied which will compute the intended number of breaks or the actual breakpoints as a function of x.
Value
an object of class "histogram" which is a list with components:
breaks
the n+1 cell boundaries (= breaks if that was a vector). These are the nominal breaks, not with the boundary fuzz.
counts
n integers; for each cell, the number of x[] inside.
density
values f^(x[i]), as estimated density values. If all(diff(breaks) == 1), they are the relative frequencies counts/n and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = breaks[i].
mids
the n cell midpoints.
xname
a character string with the actual x argument name.
equidist
logical, indicating if the distances between breaks are all the same.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Venables, W. N. and Ripley. B. D. (2002) Modern Applied Statistics with S. Springer.
See Also
nclass.Sturges, stem, density, truehist in package MASS.
Typical plots with vertical bars are not histograms. Consider barplot or plot(*, type = "h") for such bar plots.
example(hist)
##
## hist> op <- par(mfrow = c(2, 2))
##
## hist> hist(islands)
##
## hist> utils::str(hist(islands, col = "gray", labels = TRUE))
## List of 6
## $ breaks : num [1:10] 0 2000 4000 6000 8000 10000 12000 14000 16000 18000
## $ counts : int [1:9] 41 2 1 1 1 1 0 0 1
## $ density : num [1:9] 4.27e-04 2.08e-05 1.04e-05 1.04e-05 1.04e-05 ...
## $ mids : num [1:9] 1000 3000 5000 7000 9000 11000 13000 15000 17000
## $ xname : chr "islands"
## $ equidist: logi TRUE
## - attr(*, "class")= chr "histogram"
##
## hist> hist(sqrt(islands), breaks = 12, col = "lightblue", border = "pink")
##
## hist> ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:
## hist> r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),
## hist+ col = "blue1")
##
## hist> text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = "blue3")
##
## hist> sapply(r[2:3], sum)
## counts density
## 48.000000 0.215625
##
## hist> sum(r$density * diff(r$breaks)) # == 1
## [1] 1
##
## hist> lines(r, lty = 3, border = "purple") # -> lines.histogram(*)
##
## hist> par(op)
##
## hist> require(utils) # for str
##
## hist> str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks
## List of 6
## $ breaks : num [1:10] 0 2000 4000 6000 8000 10000 12000 14000 16000 18000
## $ counts : int [1:9] 41 2 1 1 1 1 0 0 1
## $ density : num [1:9] 4.27e-04 2.08e-05 1.04e-05 1.04e-05 1.04e-05 ...
## $ mids : num [1:9] 1000 3000 5000 7000 9000 11000 13000 15000 17000
## $ xname : chr "islands"
## $ equidist: logi TRUE
## - attr(*, "class")= chr "histogram"
##
## hist> str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))
## List of 6
## $ breaks : num [1:7] 12 20 36 80 200 1000 17000
## $ counts : int [1:6] 12 11 8 6 4 7
## $ density : num [1:6] 0.03125 0.014323 0.003788 0.001042 0.000104 ...
## $ mids : num [1:6] 16 28 58 140 600 9000
## $ xname : chr "islands"
## $ equidist: logi FALSE
## - attr(*, "class")= chr "histogram"
##
## hist> hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,
## hist+ main = "WRONG histogram") # and warning
## Warning in plot.histogram(r, freq = freq1, col = col, border = border, angle =
## angle, : the AREAS in the plot are wrong -- rather use 'freq = FALSE'
##
## hist> ## No test: ##D
## hist> ##D ## Extreme outliers; the "FD" rule would take very large number of 'breaks':
## hist> ##D XXL <- c(1:9, c(-1,1)*1e300)
## hist> ##D hh <- hist(XXL, "FD") # did not work in R <= 3.4.1; now gives warning
## hist> ##D ## pretty() determines how many counts are used (platform dependently!):
## hist> ##D length(hh$breaks) ## typically 1 million -- though 1e6 was "a suggestion only"
## hist> ## End(No test)
## hist> require(stats)
##
## hist> set.seed(14)
##
## hist> x <- rchisq(100, df = 4)
##
## hist> ## Don't show:
## hist> op <- par(mfrow = 2:1, mgp = c(1.5, 0.6, 0), mar = .1 + c(3,3:1))
##
## hist> ## End(Don't show)
## hist> ## Comparing data with a model distribution should be done with qqplot()!
## hist> qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)
##
## hist> ## if you really insist on using hist() ... :
## hist> hist(x, freq = FALSE, ylim = c(0, 0.2))
##
## hist> curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)
##
## hist> ## Don't show:
## hist> par(op)
##
## hist> ## End(Don't show)
## hist>
## hist>
## hist>
head(ToothGrowth)
plot(ToothGrowth$supp, ToothGrowth$len)
boxplot(len ~ supp, data = ToothGrowth)
# ggplot2
ggplot(ToothGrowth, aes(x = supp, y = len)) +
geom_boxplot()
# ggplot2
ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
geom_boxplot()
## S3 method for class 'formula'
boxplot(formula, data = NULL, ..., subset, na.action = NULL,
xlab = mklab(y_var = horizontal),
ylab = mklab(y_var =!horizontal),
add = FALSE, ann = !add, horizontal = FALSE,
drop = FALSE, sep = ".", lex.order = FALSE)
## Default S3 method:
boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,
notch = FALSE, outline = TRUE, names, plot = TRUE,
border = par("fg"), col = NULL, log = "",
pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
ann = !add, horizontal = FALSE, add = FALSE, at = NULL)
Arguments
formula
a formula, such as y ~ grp, where y is a numeric vector of data values to be split into groups according to the grouping variable grp (usually a factor). Note that ~ g1 + g2 is equivalent to g1:g2.
data
a data.frame (or list) from which the variables in formula should be taken.
subset
an optional vector specifying a subset of observations to be used for plotting.
na.action
a function which indicates what should happen when the data contain NAs. The default is to ignore missing values in either the response or the group.
xlab, ylab
x- and y-axis annotation, since R 3.6.0 with a non-empty default. Can be suppressed by ann=FALSE.
ann
logical indicating if axes should be annotated (by xlab and ylab).
drop, sep, lex.order
passed to split.default, see there.
x
for specifying data from which the boxplots are to be produced. Either a numeric vector, or a single list containing such vectors. Additional unnamed arguments specify further data as separate vectors (each corresponding to a component boxplot). NAs are allowed in the data.
...
For the formula method, named arguments to be passed to the default method.
For the default method, unnamed arguments are additional data vectors (unless x is a list when they are ignored), and named arguments are arguments and graphical parameters to be passed to bxp in addition to the ones given by argument pars (and override those in pars). Note that bxp may or may not make use of graphical parameters it is passed: see its documentation.
range
this determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
width
a vector giving the relative widths of the boxes making up the plot.
varwidth
if varwidth is TRUE, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.
notch
if notch is TRUE, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al, 1983, p. 62). See boxplot.stats for the calculations used.
outline
if outline is not true, the outliers are not drawn (as points whereas S+ uses lines).
names
group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).
boxwex
a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.
staplewex
staple line width expansion, proportional to box width.
outwex
outlier line width expansion, proportional to box width.
plot
if TRUE (the default) then a boxplot is produced. If not, the summaries which the boxplots are based on are returned.
border
an optional vector of colors for the outlines of the boxplots. The values in border are recycled if the length of border is less than the number of plots.
col
if col is non-null it is assumed to contain colors to be used to colour the bodies of the box plots. By default they are in the background colour.
log
character indicating if x or y or both coordinates should be plotted in log scale.
pars
a list of (potentially many) more graphical parameters, e.g., boxwex or outpch; these are passed to bxp (if plot is true); for details, see there.
horizontal
logical indicating if the boxplots should be horizontal; default FALSE means vertical boxes.
add
logical, if true add boxplot to current plot.
at
numeric vector giving the locations where the boxplots should be drawn, particularly when add = TRUE; defaults to 1:n where n is the number of boxes.
Details
The generic function boxplot currently has a default method (boxplot.default) and a formula interface (boxplot.formula).
If multiple groups are supplied either as multiple arguments or via a formula, parallel boxplots will be plotted, in the order of the arguments or the order of the levels of the factor (see factor).
Missing values are ignored when forming boxplots.
Value
List with the following components:
stats
a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component.
n
a vector with the number of observations in each group.
conf
a matrix where each column contains the lower and upper extremes of the notch.
out
the values of any data points which lie beyond the extremes of the whiskers.
group
a vector of the same length as out whose elements indicate to which group the outlier belongs.
names
a vector of names for the groups.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole.
Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983). Graphical Methods for Data Analysis. Wadsworth & Brooks/Cole.
Murrell, P. (2005). R Graphics. Chapman & Hall/CRC Press.
See also boxplot.stats.
See Also
boxplot.stats which does the computation, bxp for the plotting and more examples; and stripchart for an alternative (with small data sets).
example(boxplot)
##
## boxplt> ## boxplot on a formula:
## boxplt> boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
##
## boxplt> # *add* notches (somewhat funny here <--> warning "notches .. outside hinges"):
## boxplt> boxplot(count ~ spray, data = InsectSprays,
## boxplt+ notch = TRUE, add = TRUE, col = "blue")
## Warning in bxp(list(stats = structure(c(7, 11, 14, 18.5, 23, 7, 12, 16.5, : some
## notches went outside hinges ('box'): maybe set notch=FALSE
##
## boxplt> boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque",
## boxplt+ log = "y")
##
## boxplt> ## horizontal=TRUE, switching y <--> x :
## boxplt> boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque",
## boxplt+ log = "x", horizontal=TRUE)
##
## boxplt> rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque")
##
## boxplt> title("Comparing boxplot()s and non-robust mean +/- SD")
##
## boxplt> mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)
##
## boxplt> sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)
##
## boxplt> xi <- 0.3 + seq(rb$n)
##
## boxplt> points(xi, mn.t, col = "orange", pch = 18)
##
## boxplt> arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,
## boxplt+ code = 3, col = "pink", angle = 75, length = .1)
##
## boxplt> ## boxplot on a matrix:
## boxplt> mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),
## boxplt+ `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))
##
## boxplt> boxplot(mat) # directly, calling boxplot.matrix()
##
## boxplt> ## boxplot on a data frame:
## boxplt> df. <- as.data.frame(mat)
##
## boxplt> par(las = 1) # all axis labels horizontal
##
## boxplt> boxplot(df., main = "boxplot(*, horizontal = TRUE)", horizontal = TRUE)
##
## boxplt> ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :
## boxplt> boxplot(len ~ dose, data = ToothGrowth,
## boxplt+ boxwex = 0.25, at = 1:3 - 0.2,
## boxplt+ subset = supp == "VC", col = "yellow",
## boxplt+ main = "Guinea Pigs' Tooth Growth",
## boxplt+ xlab = "Vitamin C dose mg",
## boxplt+ ylab = "tooth length",
## boxplt+ xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i")
##
## boxplt> boxplot(len ~ dose, data = ToothGrowth, add = TRUE,
## boxplt+ boxwex = 0.25, at = 1:3 + 0.2,
## boxplt+ subset = supp == "OJ", col = "orange")
##
## boxplt> legend(2, 9, c("Ascorbic acid", "Orange juice"),
## boxplt+ fill = c("yellow", "orange"))
##
## boxplt> ## With less effort (slightly different) using factor *interaction*:
## boxplt> boxplot(len ~ dose:supp, data = ToothGrowth,
## boxplt+ boxwex = 0.5, col = c("orange", "yellow"),
## boxplt+ main = "Guinea Pigs' Tooth Growth",
## boxplt+ xlab = "Vitamin C dose mg", ylab = "tooth length",
## boxplt+ sep = ":", lex.order = TRUE, ylim = c(0, 35), yaxs = "i")
##
## boxplt> ## more examples in help(bxp)
## boxplt>
## boxplt>
## boxplt>
curve(x^3 - 5*x, from = -4, to = 4)
myfun <- function(xvar){
1/(1+exp(-xvar + 10))
}
curve(myfun(x), from = 0, to = 20)
curve(1 - myfun(x), add = TRUE, col = "red")
p <- ggplot(data.frame(x = c(0,20)), aes(x = x))
p <- p + stat_function(fun = myfun, geom = "line", color = "black")
p + stat_function(fun = function(t) 1 - myfun(t), geom = "line", color = "red")
Description: Draws a curve corresponding to a function over the interval [from, to]. curve can plot also an expression in the variable xname, default x.
Usage:
curve(expr, from = NULL, to = NULL, n = 101, add = FALSE, type = “l”, xname = “x”, xlab = xname, ylab = NULL, log = NULL, xlim = NULL, …)
## S3 method for class 'function'
plot(x, y = 0, to = 1, from = y, xlim = NULL, ylab = NULL, ...)
Arguments
expr
The name of a function, or a call or an expression written as a function of x which will evaluate to an object of the same length as x.
x
a ‘vectorizing’ numeric R function.
y
alias for from for compatibility with plot
from, to
the range over which the function will be plotted.
n
integer; the number of x values at which to evaluate.
add
logical; if TRUE add to an already existing plot; if NA start a new plot taking the defaults for the limits and log-scaling of the x-axis from the previous plot. Taken as FALSE (with a warning if a different value is supplied) if no graphics device is open.
xlim
NULL or a numeric vector of length 2; if non-NULL it provides the defaults for c(from, to) and, unless add = TRUE, selects the x-limits of the plot – see plot.window.
type
plot type: see plot.default.
xname
character string giving the name to be used for the x axis.
xlab, ylab, log, ...
labels and graphical parameters can also be specified as arguments. See ‘Details’ for the interpretation of the default for log.
For the "function" method of plot, ... can include any of the other arguments of curve, except expr.
Details
The function or expression expr (for curve) or function x (for plot) is evaluated at n points equally spaced over the range [from, to]. The points determined in this way are then plotted.
If either from or to is NULL, it defaults to the corresponding element of xlim if that is not NULL.
What happens when neither from/to nor xlim specifies both x-limits is a complex story. For plot(<function>) and for curve(add = FALSE) the defaults are (0, 1). For curve(add = NA) and curve(add = TRUE) the defaults are taken from the x-limits used for the previous plot. (This differs from versions of R prior to 2.14.0.)
The value of log is used both to specify the plot axes (unless add = TRUE) and how ‘equally spaced’ is interpreted: if the x component indicates log-scaling, the points at which the expression or function is plotted are equally spaced on log scale.
The default value of log is taken from the current plot when add = TRUE, whereas if add = NA the x component is taken from the existing plot (if any) and the y component defaults to linear. For add = FALSE the default is ""
This used to be a quick hack which now seems to serve a useful purpose, but can give bad results for functions which are not smooth.
For expensive-to-compute expressions, you should use smarter tools.
The way curve handles expr has caused confusion. It first looks to see if expr is a name (also known as a symbol), in which case it is taken to be the name of a function, and expr is replaced by a call to expr with a single argument with name given by xname. Otherwise it checks that expr is either a call or an expression, and that it contains a reference to the variable given by xname (using all.vars): anything else is an error. Then expr is evaluated in an environment which supplies a vector of name given by xname of length n, and should evaluate to an object of length n. Note that this means that curve(x, ...) is taken as a request to plot a function named x (and it is used as such in the function method for plot).
The plot method can be called directly as plot.function.
Value
A list with components x and y of the points that were drawn is returned invisibly.
Warning
For historical reasons, add is allowed as an argument to the "function" method of plot, but its behaviour may surprise you. It is recommended to use add only with curve.
See Also
splinefun for spline interpolation, lines.
example(curve)
##
## curve> plot(qnorm) # default range c(0, 1) is appropriate here,
##
## curve> # but end values are -/+Inf and so are omitted.
## curve> plot(qlogis, main = "The Inverse Logit : qlogis()")
##
## curve> abline(h = 0, v = 0:2/2, lty = 3, col = "gray")
##
## curve> curve(sin, -2*pi, 2*pi, xname = "t")
##
## curve> curve(tan, xname = "t", add = NA,
## curve+ main = "curve(tan) --> same x-scale as previous plot")
##
## curve> op <- par(mfrow = c(2, 2))
##
## curve> curve(x^3 - 3*x, -2, 2)
##
## curve> curve(x^2 - 2, add = TRUE, col = "violet")
##
## curve> ## simple and advanced versions, quite similar:
## curve> plot(cos, -pi, 3*pi)
##
## curve> curve(cos, xlim = c(-pi, 3*pi), n = 1001, col = "blue", add = TRUE)
##
## curve> chippy <- function(x) sin(cos(x)*exp(-x/2))
##
## curve> curve(chippy, -8, 7, n = 2001)
##
## curve> plot (chippy, -8, -5)
##
## curve> for(ll in c("", "x", "y", "xy"))
## curve+ curve(log(1+x), 1, 100, log = ll, sub = paste0("log = '", ll, "'"))
##
## curve> par(op)
op <- par(mfrow = c(1,2))
barplot(BOD$demand, names.arg = BOD$Time, main = "棒グラフのラベルを指定", xlab = "時間経過(日)", ylab = "酸素の必要量 (mg/l)")
hist(mtcars$mpg, breaks = 10)
par(op)
str(BOD)
## 'data.frame': 6 obs. of 2 variables:
## $ Time : num 1 2 3 4 5 7
## $ demand: num 8.3 10.3 19 16 15.6 19.8
## - attr(*, "reference")= chr "A1.4, p. 270"
head(BOD)
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
head(cars)
str(longley)
## 'data.frame': 16 obs. of 7 variables:
## $ GNP.deflator: num 83 88.5 88.2 89.5 96.2 ...
## $ GNP : num 234 259 258 285 329 ...
## $ Unemployed : num 236 232 368 335 210 ...
## $ Armed.Forces: num 159 146 162 165 310 ...
## $ Population : num 108 109 110 111 112 ...
## $ Year : int 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 ...
## $ Employed : num 60.3 61.1 60.2 61.2 63.2 ...
head(longley)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars)
str(pressure)
## 'data.frame': 19 obs. of 2 variables:
## $ temperature: num 0 20 40 60 80 100 120 140 160 180 ...
## $ pressure : num 0.0002 0.0012 0.006 0.03 0.09 0.27 0.75 1.85 4.2 8.8 ...
head(pressure)
str(Titanic)
## 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
head(as.data.frame(Titanic))
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
head(ToothGrowth)